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Abstract 

Frequently, a set of objects has to be evaluated by a panel of assessors, but not 
every object is assessed by every assessor. A problem facing such panels is how to 
take into account different standards amongst panel members and varying levels of 
confidence in their scores. Here, a mathematically-based algorithm is developed to 
calibrate the scores of such assessors, addressing both of these issues. The algorithm 
is based on the connectivity of the graph of assessors and objects evaluated, incor¬ 
porating declared confidences as weights on its edges. If the graph is sufficiently well 
connected, relative standards can be inferred by comparing how assessors rate ob¬ 
jects they assess in common, weighted by the levels of confidence of each assessment. 
By removing these biases, “true” values are inferred for all the objects. Reliabil¬ 
ity estimates for the resulting values are obtained. The algorithm is tested in two 
case studies, one by computer simulation and another based on realistic evaluation 
data. The process is compared to the simple averaging procedure in widespread 
use, and to Fisher’s additive incomplete block analysis. It is anticipated that the 
algorithm will prove useful in a wide variety of situations such as evaluation of the 
quality of research submitted to national assessment exercises; appraisal of grant 
proposals submitted to funding panels; ranking of job applicants; and judgement of 
performances on degree courses wherein candidates can choose from lists of options. 


Keywords: Calibration, evaluation, assessment, confidence, uncertainty, model com¬ 
parison. 
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1 Introduction 


This paper addresses the widespread problem of how to take into account differences in 
standards, confidence and bias in assessment panels, such as those evaluating research 
quality or grant proposals, employment or promotion applications, and classification 
of university degree courses, in situations where it is not feasible for every assessor to 
evaluate every object to be assessed. 

A common approach to assessment of a range of objects by such a panel is to assign 
to each object the average of the scores awarded by the assessors who evaluate that 
object. This approach is represented by the cell labelled “simple averaging” (SA) in the 
top left of a matrix of approaches listed in Table 1, but it ignores the likely possibility 
that different assessors have different levels of stringency, expertise and bias [1]. Some 
panels shift the scores for each assessor to make the average of each take a normalised 
value, but this ignores the possibility that the set of objects assigned to one assessor may 
be of a genuinely different standard from that assigned to another. For an experimental 
scientist, the issue is obvious: calibration. 

One solution is to seek to calibrate the assessors beforehand on a common subset 
of objects, perhaps disjoint from the set to be evaluated [2]. This means that they 
each evaluate all the objects in the subset and then some rescaling is agreed to bring 
the assessors into line as far as possible. This would not work well, however, in a 
situation where the range of objects is broader than the expertise of a single assessor. 
Also, regardless of how well the assessors are trained, differences between individuals’ 
assessments of objects remain in such ad hoc approaches [3]. 

If the expertise of two assessors overlap on some subject, however, any discrepancy 
between their evaluations can be used to infer information about their relative standards. 
Thus if the graph T^ on the set of assessors, formed by linking two whenever they assess 
a common object, is sufficiently well connected one can expect to be able to infer a robust 
calibration of the assessors and hence robust scores for the objects. The construction of 
this graph is illustrated in Figure [2 beginning from the graph T showing which objects 
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Table 1: Panel Assessment Methods: The matrix of four approaches according to use 
of calibration and/or confidences. Simple averaging (SA) is the base for comparisons. 
Fisher’s IBA does not deal with varying degrees of confidence and the confidence- 
weighted averaging doesn’t achieve calibration. The method proposed herein (CWC) 
accommodates both calibration and confidences. 
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Figure 1: Three examples of assessment graphs T showing which object o 3 is assessed by 
which assessor a*., and the resulting graphs T^ on the set of assessors where two assessors 
are linked if they assess an object in common. Case (a) produces a fully connected 
assessor graph, (b) a moderately connected graph, whereas case (c) is disconnected. 


are assessed by which assessors. 

One approach to achieving such calibration was developed by R.A.Fisher [3], in the 
context of trials of crop treatments. Denoting the score from assessor a for object o by 
s ao , Fisher’s approach is based on fitting a model of the form s ao = v a + b a + e ao with 
e ao independent identically distributed random variables of mean zero. Then b a is the 
bias inferred for assessor a and v Q is the value inferred for object o. Fisher’s approach 
is known as additive incomplete block analysis (IBA) and a body of associated literature 
and applications has since been developed [5], though its use in panel assessment seems 
rare. It is represented as the bottom left entry of Table 1. 
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Another ingredient that is important in many panel assessments, however, is dif¬ 
ferent weights that may be put on different assessments. We refer to these weights as 
“confidences”. Fisher’s IBA does not take different levels of confidence into account. 

If the assessors express confidences in the assessments, for example by some pre¬ 
determined weights assigned to types of assessment or by the assessors declaring confi¬ 
dences in each of their scores, then it is natural to replace simple averaging by confidence- 
weighted averaging (CWA). This is represented as the top right element of Table 1, but 
it doesn’t address the calibration issue so we do not consider it further. 

In this paper we present and test a method to calibrate scores taking into account 
confidences, that is, we complete the bottom-right corner of the matrix of approaches 
represented in Table 1, where our method is termed calibration with confidence (CWC). 
We demonstrate that the method can achieve a greater degree of accuracy with fewer 
assessors than either simple averaging or IBA, and we derive robustness estimates taking 
the confidences into account. 

We are aware of two other schemes that incorporate confidences into a calibration 
process. One is the abstract-review method for the SIGKDD’09 conference (section 4 
of |6j; see also |7]). The other is the abstract-review method used for the NIPS2013 
conference (building on [8| and described in [9]). Our method has the advantages of sim¬ 
plicity of implementation and a straightforward robustness analysis. We leave detailed 
comparison with methods such as these for future publication. 

2 The model 

Let us suppose that each assessor is assigned a subset of the objects to evaluate. Denote 
the resulting set of (assessor, object) pairs by E. Let us further suppose that the score 
s ao that assessor a assigns to object o is a real number related to a “true” value v Q for 
the object by 

Sao — Vo + b a + £aoi ( 1 ) 

where b a can be called the bias of assessor a and e ao are independent zero-mean random 
variables. Such a model forms the basis for additive incomplete block analysis. This was 
also proposed in ref. [TO] (see equation (8.2b) therein) but without a method to estimate 
the true values. Here we will achieve this and make a significant improvement, namely 
the incorporation of varying confidences in the scores. 

To take into account the varying expertise of the assessors with respect to the ob¬ 
jects, we propose that in addition to the score s ao , each assessor is asked to specify a 
level of confidence for that evaluation. This could be in the form of a rating such as 
“high”, “medium”, “low”, as requested by some funding agencies, but we propose to 
allow something more general and akin to experimental science. Confidence can be es¬ 
timated by asking assessors to specify an uncertainty a ao > 0 for their score and then 
the confidence level (or “precision”) is taken to be 

Cao = 1 /&ao- ( 2 ) 
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The instructions to the assessors can be to choose s ao and a ao so that 2/3 of their 
probability distribution for the score lies in [s ao — a ao , s ao + &ao], 1/6 above this interval 
and 1/6 below it. Methods for training assessors to estimate uncertainties are presented 
in PH- There are also methods for training assessors on the assessment criteria to 
improve their accuracy [12], which could also be expected to have the beneficial effect of 
reducing their uncertainties. 

So let us suppose that 

£ao — ^ao^laoi (3) 

with ria 0 independent zero-mean, random variables of common variance w. For the 
moment, we set w = 1; extensions to other values of w are considered in Appendix 
A, and in particular are necessary if confidence is expressed only qualitatively. In the 
case that confidences are reported as only high, medium or low, they can be converted 
into quantitative ones by for example choosing A « 2 and setting c ao = A 2 ,1,A~ 2 , 
respectively. The interpretation of A is the ratio of the uncertainty for a low confidence 
evaluation to that for a medium one, and for a medium one to a high one. Then w is 
unspecified but can be fit from the data, as in Appendix A. 

Thus our basic model is 

Sao — V Q + b a + Uaoljao- (4) 


3 Solution of the model 

Given the data {(s ao ,cr ao ) : (a, o) € E} for all assigned assessor-object pairs, we wish to 
extract the true values v a and assessor biases b a . The simplest procedure is to minimise 
the sum of squares 

y ] Vao = ^ Cao(s ao — Vo — b a ) , ( 5 ) 

(a,o)£E (a,o)£E 

where the confidence level c ao was defined in Equation ©• This procedure can be 
justified if the r] ao are assumed to be normally distributed, because then it gives the 
maximum-likelihood values for v Q and b a . It can also be viewed as orthogonal projection 
of the vector s of scores s ao to the subspace of the form s ao = Vo + b a in the Riemannian 
metric given by \s\ = \fEa 0 c a 0 s 2 a 0 ■ 

Now expression ([5|) is minimised with respect to v Q iff 

^ ^ Cao(,Sao &a) = 0, 

a:(a,o)^E 

and with respect to b a iff 

y ^ Cao(Sao V 0 &a) = 0. 
o:(a,o)€E 

It is notationally convenient to extend the sums to all assessors (respectively objects) by 
assigning the value c ao = 0 to any assessor-object pair that is not in E (i.e. for which a 
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score was not returned). Then these conditions can be written as 


Here, 


C 0 Vo T ^ ( b a c ao — V Q 

a 

^ ^ Cdo^o H - C a b a = B a . 

o 




Can^n 


is the confidence-weighted total score for object o and 


( 6 ) 

( 7 ) 

( 8 ) 


B a = Y J 


Can S n 


(9) 


is that for assessor a, 

Co : ^ ^ C-ao 

a 

is the total confidence in the assessment of object o and 

o 


( 10 ) 


( 11 ) 


is the total confidence expressed by assessor a. 

Equations © and © form a linear system of equations for the v a and b a . It has an 
obvious degeneracy in that one could add a constant k to all the v Q and subtract k from 
all the b a and obtain another solution. One can remove this degeneracy by, for example, 
imposing the condition 

E 6 « = °- ( 12 ) 

a 

This is the simplest possibility and corresponds to a translation (shift) that brings the 
average bias over assessors to zero. Alternatives are discussed in Appendix B. 

Define a graph T linking assessor a to object o if and only if (a, o) € E, as illustrated 
in the left column of Figure |T| The edges in the graph are weighted by the confidences 
c ao ■ Whether the set of equations (J2I) and © has a unique solution after breaking the 
degeneracy depends on the connectivity of T. Define a linear operator L by writing 
equations Q and © as 


L 


V 


' V ' 

b 


B 


(13) 


where v,b,V and B denote the column vectors formed by the v 0 ,b a ,V 0 and B a respec¬ 
tively. The operator L has null space of dimension equal to the number of connected 
components of T (this follows from Perron-Frobenius theory, see e.g. ref. [13]). Thus if T 
is connected, the null space of L has dimension one, so corresponds precisely to the null 
vectors v 0 = k Vo, b a = —k Va, that we already noticed and dealt with. Connectedness 
of T ensures that if (11311 has a solution then there is a unique one satisfying (1121) . 
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It remains to check that the right-hand side of equation (1131) lies in the range of L, 
thus ensuring that a solution exists. This is true if all null forms of the adjoint operator 
id send the right-hand side to zero. The null space of L' has the same dimension as 
that of L, because L is square, and an obvious non-zero null form a is given by 

a(v,b) = £u 0 - £l a . (14) 


It follows from the definitions of V and B that a(V, B ) =0. So a solution exists. 

Thus under the assumption that the assessor-object graph T is connected, equations 
& and © have a unique solution (v,b) satisfying equation (fl2l) . Note that connected¬ 
ness of T is necessary for uniqueness, otherwise one could follow an analogous procedure, 
adding and subtracting constants independently in each connected component of T, and 
thereby produce more solutions. 

The equations m have a special structure, due to the bipartite nature of T, that 
can be worth exploiting. The first equation © can be written as 


v 0 


Vo - Eg bgCg 

Co 


This can be substituted into the second equation © to obtain 


(15) 


where 


£ c aa ,b a , - c’ a b a = £ ^ - B a , 


Caa' = £ 


Cn.nCn/c 


Co 


(16) 


(17) 


can be considered as weights on the edges of the graph on assessors illustrated in 
the right column of Figure [IJ The dimension of the reduced system (1161) is the number 
Na of assessors (rather than the sum of the numbers of assessors and objects), which 
gives some computational savings. Replacing one of the equations in (1161) . say that for 
the “last” assessor, by equation (fT2l) gives a system with a unique solution that can be 
solved for b by any method of numerical linear algebra, e.g. LUP decomposition [14] . 
Then v can be obtained from equation (j 15 1) . 

A slightly more sophisticated approach to incorporating a degeneracy-breaking con¬ 
dition into the equations (fTEl) is described in Appendix B. 

A key question with any black-box solution like the one presented here is how robust 
is the outcome? We propose two ways of quantifying the robustness. One is to bound how 
much the outcome would change if some of the scores were changed (e.g. representing 
mistakes or anomalous judgements). We treat this in Appendix C. The other is to 
evaluate the posterior uncertainty of the outcomes, assuming normal distribution of the 
rj ao . This is treated in Appendix D. 
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4 Case Studies 


We have tested the approach in three contexts. We report in detail on two case studies 
here. In the first case study, we use a computer-generated set of data containing true 
values of assessed items, assessor biases and confidences for the assessments, and resulting 
scores. This has the advantage of allowing us to compare the values obtained by the 
new approach with the true underlying value of each item. The second case study is 
an evaluation of grant proposals using realistic data based on a university’s internal 
competition. In this test, of course, there is no possibility to access “true” values, so 
instead we compare the evidence for the models using a Bayesian approach (Appendix 
E), and we compare their posterior uncertainties (Appendix D). The third context in 
which we tested our method was assessment of students; we report briefly on this at the 
end of the section. 

4.1 Case Study 1 — Simulation 

In the simulation, No = 3000 objects are assessed by a panel of Na = 15 assessors. This 
choice was motivated by the number of outputs and reviewers in the applied mathematics 
unit of assessment at the UK’s 2008 research assessment exercise. The simulation was 
carried out using MATLAB, and the system of equations was solved using its built-in 
procedure, which computed the LU decomposition of L (with the last row replaced by 
the degeneracy-breaking condition (fT2j) l. The reduction to (fT6l) was not used because 
No = 3000 is easily handled by modern personal computers. 

True values of the items v Q were assumed to be normally distributed with a mean 
of 50 and standard deviation set to 15, but with v a values truncated at 0 and 100. 
The assessor biases b a were assumed to be normally distributed with a mean of 0 and a 
standard deviation of 15. Each assessment was considered to be done with high, medium, 
or low confidence, and these were modelled using scaled uncertainties for the awarded 
scores, of a ao = 5, 10 or 15 respectively. The allocated scores follow equation ([3D, but 
truncated at 0 and 100. 

With r assessors per item (which we took to be the same for each item in this 
instance), each simulation generated rNo object scores s ao . From these, we generated 
No value estimates v Q and Na estimates of assessor biases b a using the calibration 
processes. We then took the mean and maximum values of the errors in the estimates, 
dv Q = |D 0 — v a \ and db a = \b a — b a \. Simple averaging also delivered a value estimate v a , as 
well as mean and maximal values of the errors dv a . Finally, we determined the averages 
of the errors dv Q and db a over 100 simulations. The results for these averaged mean and 
maximal errors in the scores are denoted by (dv) and (du) max , respectively and those for 
the biases (for the calibrated approaches only) are denoted (db) and (d6) max . 

Results for all three methods are presented in Figs. [2HU The mean and maximal 
errors for the simple averaging approach, the IBA method and the CWC approach are 
given in Panels (a)-(d) of Figs. [2] and [3j For demonstration purposes, we use three 
confidence levels rather than a continuous distribution. This allows us to clearly control 
differences in confidence levels in Figs. [2] and [3] and we do so by presenting four panels 
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Confidence profile = 1:1:1 


Confidence profile = 1:1:2 
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Figure 2: Mean errors plotted against the number r of assessors per object for the 
simple averaging approach (upper curves, orange), the incomplete-block-analysis method 
(middle curves, green) and the calibration-with-confidence approach (lower curves, blue). 
The various panels represent different confidence profiles with probabilities for high, 
medium and low confidences in the ratios (a) 1:1:1, (b) 1:1:2, (c) 1:2:1, (d) 2:1:1. 


labeled (a),(b),(c) and (d). These represent different profiles, with the confidence for each 
assessment randomly allocated using probabilities for high, medium and low confidences 
in the ratios (a) 1:1:1, (b) 1:1:2, (c) 1:2:1, (d) 2:1:1. We observe that, for each method, 
the scores become more accurate (errors decrease) as the number of assessors per object 
r increases. 

From Fig. (2](a)-(d), with only two assessors per object, the simple averaging method 
gives errors averaging about 10 points. Over r = 6 assessors per object are required 
to bring the mean error down to 6 points. Fisher’s IBA, however, achieves this level of 
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Figure 3: Maximum errors plotted against the number r of assessors per object for the 
simple averaging approach (upper curves, orange), the incomplete-block-analysis method 
(middle curves, green) and the calibration-with-confidence approach (lower curves, blue). 
The various panels represent different confidence profiles with probabilities for high, 
medium and low confidences in the ratios (a) 1:1:1, (b) 1:1:2, (c) 1:2:1, (d) 2:1:1. 


improvement with only 2 or 3 assessors. The CWC method delivers a further level of 
improvement of about one point. One also notes that, for the calibration approaches, 
relatively little is gained on average by employing more than four assessors per object. 
This result can be compared with m who found that five assessors per object was 
optimal in terms of accuracy over cost, for a procedure used by the Canadian Institutes 
of Health Research. 

Fig. [3 shows that IBA also leads to significant improvements in the maximal error 
values relative to those obtained through simple averaging. With two assessors per 
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1 : 1:1 1 : 1:2 1 : 2:1 2 : 1:1 
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CWC . .. . . 


Figure 4: (a) The ratios {dv)iBA/(dv) ayg and (dv) cwc/ {dv) avg measure the mean 

improved accuracies of IBA (green curves) and CWC (blue), respectively, over sim¬ 
ple averaging. Smaller ratios indicate a greater degree of improvement over SA. 
(b) The analogous quantities for maximal errors are (dv) m ax tr a / (dv) m a x avg and 
(^)max,cwc/(^)max,a?g i respectively. The four line types correspond to relative proba¬ 
bilities of standard deviations of 5, 10 or 15 respectively in the ratios 1:1:1 (solid lines); 
1:1:2 (long-dashed); 1:2:1 (short-dashed) and 2:1:1 (dotted). 


object, maximal errors are reduced from about 45 to 30-35. The CWC approach does 
not appear to significantly improve upon this. However, with 6 assessors per object the 
maximal error value of about 25 delivered by the simple averaging process is reduced to 
about 20 by IBA and to as low as 16 by CWC when half the assessments are done with 
a high degree of confidence in the scores. 

Fig.|4]panel (a) gives the improvements achieved by the calibration methods as ratios 
of the mean errors coming from Fisher’s IBA approach to the simple averaging approach 
((I?;)iba/ (dv) avg and of the mean errors coming from the CWC approach to the simple 
averaging approach ( dv)cwc/(dv ) avg . Smaller ratios mean greater accuracy on the part 
of the calibrated approaches. Fig. |4] panel (b) gives the analogous accuracy ratios for the 
maximal errors, namely (dw) m a X ,iBA/(^)max,avg and (du) maX! cwc/(dn) maXiavg . Fig. UK a) 
demonstrates that IBA delivers mean errors between about 60% and 80% of those coming 
from the simple averaging approach, the better improvements being associated with lower 
assessor numbers. This is also the most desirable configuration for realistic assessments, 
as it represents employment of a minimal number of assessors per object. The CWC 
approach reduces errors by about a further 10 percentage points irrespective of the 
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number of assessors. 


4.2 Case Study 2 — Grant Proposals 

To test CWC in a realistic setting, we adapted data from a university’s internal com¬ 
petition for research funding, in which 43 proposals were evaluated by a panel of 11 
assessors. Each proposal was graded by two assessors, who in addition each specified a 
confidence-level in their grading in the form of high, medium or low. To respect confi¬ 
dentiality of the competition while making the data available, we not only anonymised 
the proposals and assessors but also made sufficient changes to the data (while preserv¬ 
ing the statistical properties) so that attribution would not be possible. The actual 
panel used simple averaging, but the assessors were also asked to provide confidences 
so that CWC could be applied for comparison. The panel awarded grants to the top 
ten proposals. Our goals were firstly to see what differences would have been made by 
use of IB A or CWC, secondly to quantify the evidence for the three models from the 
data to determine which was most appropriate, and thirdly to compare the posterior 
uncertainties they provide. 

To apply CWC we translated the qualitative confidence-levels of high, medium and 
low to values c ao = A 2 ,1,A~ 2 , respectively, with A = 1.75. We chose A = 1.75 as a 
reasonable guess at how the assessors used the confidence scale. One could include a 
computation to infer A from the data, but our preference is for panel chairs to ask 
assessors to provide uncertainties rather than qualitative confidence levels, as indicated 
in Section [21 so we did not implement the inference of A. 

Figure [5] (panels a,b, and c) shows the resulting values inferred by the three methods, 
projected into the planes of (SA; IBA), (IBA; CWC) and (CWC; SA). Panel d of the 
same figure is a Bland-Altman or Tukey mean-difference plot [Tfj|. The correlations are 
not strong, though as we would expect, the correlation of IBA with CWC is stronger 
than those of either with SA. In particular, we note that the set of proposals rated in 
the top ten varies substantially with the method used (Table [21) • The reason for the 
differences is that IBA and CWC attribute a significant range of biases to the assessors 
(Table [3j). 

In the absence of “true” values for the proposals, how can one decide which is the 
best method to use, and hence which outcome is preferred? 

A first answer is to compare the “residuals” that the methods leave after the least 
squares fit. In the case of SA this means the value of ([5]) obtained by taking the v a to 
be the averaged scores and b a = 0. For IBA, the residual is the value of ([5]) at the least 
squares fit, taking all the c ao = 1- For CWC, we take the value of ([3]) at the least squares 
fit, divided by the average confidence over all assessments. The residuals are presented 
in Table [H From this point of view, we see clear improvement progressively from SA to 
IBA to CWC, providing an apparently compelling argument for the use of CWC. 

As IBA and CWC have more free parameters (the biases) than SA, however, one 
should penalise them appropriately to make a correct comparison. Also although nor¬ 
malising the residual for CWC by the average confidence sounds sensible, it is not clear 
it is the right way to compare CWC with IBA. 


12 



SA 



I BA 




CWC Means 


Figure 5: Correlations between the results coming from the three methods applied to 
Case Study 2. The three panels give the correlations between the outputs of (a) IBA and 
SA; (b) CWC and IBA; (c) SA and CWC. The coefficients of determination are given 
respectively by B 2 = 0.5701; 0.8807 and 0.3772. Panel (d) is a Bland-Altman or Tukey 
mean-difference plot of differences between between results from pairs of approaches 
against their averages. The symbols “+” (red) compare CWC to IBA (Vcwc ~ Viba vs 
(Viba + Vcwc)/2); “x” (green) compare IBA to SA (Viba - V avg vs (V avg + Viba)/2); 
“o” (blue) compare SA to CWC (V avg - Vcwc vs (Vcwc + V avg )/2). 


A principled answer is provided by Bayesian model comparison. In this procedure, 
the evidence provided by the data in favour of each model is quantified, and the best 
model is the one with the highest evidence. The procedure to quantify the evidence for 
the three models is described in Appendix E. It depends on assumptions about the prior 
probability distribution for the parameters of the models, but we took “ball” priors on the 
true values and on the biases (constrained by the degeneracy-breaking condition) and a 
truncated Jeffreys’ prior on the variance of the noise. In the notation of Appendix E, the 
parameters for the prior probability distributions were ao = 22.5, a a = 15, w ma x = 900, 
Wmin = 1- As the evidences come out to be small numbers (around 10 -168 ), we took 
their (natural) logarithms. The resulting log-evidences are shown in Table [5j Simple 
averaging wins, but these values are so close together that we can not make a strong 
conclusion about which method is most justified by the data. Furthermore, adjusting 
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the prior probability distributions and the confidence weights changes which method has 
the highest evidence. We suspect that differences between the evidences for the models 
would become apparent if each proposal had been evaluated by more than two assessors. 
A third approach is to evaluate the posterior uncertainty in the values assigned to the 


Rank 

SA V avg 

IBA Visa 

CWC Vcwc 

1 

OH (87.0) 

OA (85.3) 

OA (88.8) 

2 

OP (87.0) 

OC (84.9) 

OB (85.2) 

3 

OC (86.0) 

OH (80.6) 

OC (84.9) 

4 

OS (84.0) 

OP (79.7) 

OD (82.8) 

5 

OA (80.5) 

OD (79.5) 

OE (82.0) 

6 

OM (80.51 

OB (79.4) 

OF (78.9) 

7 

OZ (80.5) 

OF (78.6) 

OG(78.4) 

8 

OF (79.5) 

OE (76.9) 

OH (77.3) 

9 

OA' (78.5) 

OS (76.7) 

OI (77.1) 

10 

OI (78.0) 

OJ (76.4) 

OJ (75.6) 


Table 2: The 43 grant proposals are identified as OA, OB, OC, ... OZ, OA', OB', ... OP', 
OQ'. Here they are ranked according to their V avg , Viba and Vcwc values, representing 
the outcomes of simple averaging, the IBA and CWC approaches. Proposals identified 
by CWC as belonging to the top ten but missed by IBA are highlighted in boldface. 
Proposals identified by IBA or CWC as belonging to the top ten but missed by simple 
averaging are highlighted in italics. Proposals which are not in the CWC top ten are 
underlined. 


Assessor 

Mean 

St. dev. 

Bias (IBA) 

Bias (CWC) 

AK 

84.2 

16.6 

14.6 

17.7 

AJ 

61.0 

19.2 

8.7 

12.6 

AI 

64.6 

10.0 

0.0 

9.7 

AH 

76.6 

9.1 

10.0 

9.1 

AG 

71.9 

6.9 

8.8 

8.8 

AF 

65.9 

5.6 

5.7 

2.0 

AE 

72.3 

15.5 

2.8 

1.1 

AD 

61.0 

21.9 

- 5.0 

-3.6 

AC 

62.3 

9.6 

-12.4 

-15.6 

AB 

58.3 

6.4 

-12.8 

-16.6 

AA 

49.1 

12.1 

-20.7 

-25.2 


Table 3: Assessor statistics: Assessors are labeled AA, ... AK according to increas¬ 
ing CWC-biases (5th column). Here we give the mean scores they awarded, standard 
deviations and IBA-biases too. The mean score awarded over all assessments was 66.9. 
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Method 

SA 

IBA 

CWC 

Residual 

8602 

4388 

3156 


Table 4: Residuals (scaled by mean confidence in the case of CWC). 


Method 

SA 

IBA 

CWC 

log-Evidence 

-385 

-389 

-387 


Table 5: Bayesian log-Evidences. 


Method 

SA 

IBA 

CWC 

Uncertainty 

14.1 

8.4 

8.0 


Table 6: Confidence-weighted root mean-square uncertainties for the values (and biases 
in the cases of IBA and CWC). For SA, the weighting is according to the number n D of 
assessors for object o. 

objects for the three methods, as detailed in Appendix D, using (1431) for IBA and CWC, 
and (1441) for SA. The results are given in Table El On this basis, the most precise results 
are given by CWC. None of them are very precise, however. A posterior uncertainty of 
8 means that we should consider values for the objects to have a | chance of differing 
by more than 8 from the outputted values. This means that for IBA and CWC, only 
the top three proposals of Table [2] are reasonably assured of being in the top ten. 

As the object of the competition was only to choose the best 10 proposals to fund, 
rather than assign values to each proposal, it might have been more appropriate to 
design just a classifier system (with a tunable parameter to make the right number in 
the “fund” class) but our goal was to use it as a test of CWC. 

The fact that three different methods with roughly equal evidence lead to drastically 
different allocation of the grants, and with large posterior uncertainties, highlights that 
better design of the panel assessment was required. Large variability of outcome even 
when just using SA but with different assessment graphs was already noted by m- a 
moral of our analysis is that to achieve a reliable outcome, the assessment procedure 
needs substantial advance design. We continue a discussion of design in Appendices C 
and F, but substantial treatment is deferred to a future paper. 

4.3 Third Context - Assessment of students 

We also tested the method on undergraduate examination results for a degree with 
a flexible options system [f8] and on the assessment of a multi-lecturer postgraduate 
module. 
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In the former case, as surrogates for the confidences in the marks we took the number 
of Credit Accumulation and Transfer Scheme (CATS) points for the module, which 
indicate the amount of time a student is expected to devote to the module (for readers 
used to the European credit transfer and accumulation system, 2 CATS points are 
equivalent to 1 ECTS point). The amount of assessment for a module is proportional 
to the CATS points. If it can be regarded as consisting of independent assessments of 
subcomponents, e.g. one per CATS point, with roughly equal variances, then the variance 
of the total score would be proportional to the number of CATS points. As the score is 
then normalised by the CATS points, the variance becomes inversely proportional to the 
CATS points, making confidence directly proportional to CATS points. The outcome of 
our analysis indicated significant differences in standards for the assessment of different 
modules, but as most modules counted for 15 or 18 CATS, this was not a strong test of 
the merits of including confidences in the analysis, so we do not report on it here. 

For the postgraduate module, there were four lecturers plus module coordinator, who 
each assessed oral and written reports for some but not all of the students, according 
to availability and expertise (except the coordinator assessed them all). Each assessor 
provided a score and an uncertainty for each assessment. The results were combined 
using our method and the resulting value for each student was reported as the final 
mark. The lecturers agreed that the outcome was fair. 

5 Discussion 

We have presented and tested a method to calibrate assessors in a panel, taking account 
of differences in confidence that they express in their assessments. From a test on simu¬ 
lated data we found that Calibration with Confidence (CWC) generated closer estimates 
of the true values than Additive Incomplete Block Analysis (IBA) or Simple Averaging 
(SA). A test on some real data, however, provided little evidence to distinguish between 
the methods, though they produced wildly different rankings, suggesting that the as¬ 
sessment procedure for that context needed more robust design. Nevertheless, CWC 
came ahead on posterior precision. We note that the default of assuming all assessment 
confidences to be equal results in IBA, which already represents a useful improvement 
over SA. 

One of the principal conclusions from our analysis is that to achieve reliable outcomes 
from the methods we tested, requires good design of the assessment graph (showing which 
objects are evaluated by which assessors and with what confidences). 

All three methods we compared are based on least squares fitting. They may therefore 
be considered overly sensitive to outliers. An alternative approach which is less sensitive 
to outliers is based on medians rather than means. For example, Tukey’s Median Polish 
m is a median-based version of Fisher’s IBA. It would be good to develop a version of 
it that takes confidences into account too. 

Some other drawbacks of our CWC method are: 

• it requires assessors to give reliable uncertainties; if assessors differ in their con¬ 
fidence estimates the method gives higher weight to those who give higher confi- 
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dences. In particular, one needs to guard against an assessor giving unwarrantedly 
high confidence for a particular assessment. There is a case for calibrating confi¬ 
dences too. 

• bias effects may be more subtle than just an additive effect; for example an as¬ 
sessor may be more generous (or perhaps tougher) on topics in which they have 
high confidence, or they may use a shorter or longer part of the scale than other 
assessors. 

• some organisations insist on round-number scores; this goes against the spirit of 
our approach and is awkward for assessors who may rightly wish to rate an object 
as between two of the allowed grades. The requirement is perhaps based on the 
laudable idea of not wishing to imply higher accuracy than is warranted, yet in 
our opinion this is better dealt with by reporting an uncertainty for each result on 
a continuous scale. 

• some organisations may insist that scores can not go beyond certain limits, which 
is awkward for an assessor if after evaluating several objects highly they find there 
are some they wish to rate even higher. 

There are a number of refinements which one could introduce to the core method, 
addressing some of these drawbacks. These include how to deal with different types 
of bias, different scales for confidence, different ways to remove the degeneracy in the 
equations, how to deal with the endpoints on a marking scale, and how to choose the 
assessment graph. Some suggestions are made in the Appendices, along with mathe¬ 
matical treatment of the robustness of the method and of computation of the Bayesian 
evidence for the models. 

An advantage of our type of calibration is that it does not produce the artificial 
discontinuities across field boundaries that tend to arise if the domain is partitioned into 
fields and evaluation in each held carried out separately. In the UK Research Assessment 
Exercise 2008 for example, there is evidence that different panels had different standards 
[20j . Although RAE2008 stated that cross-panel comparisons are not justified, some 
universities have used such comparisons to help decide on how much to resource different 
departments. Our approach would take advantage of cross-panel referrals (which was 
part of RAE2008 for work in the overlaps between panels) to infer relative standards 
and hence to normalise the outcomes. 

We suggest that a method such as this, which takes into account declared confidences 
in each assessment, is well suited to a multitude of situations in which a number of 
objects is assessed by a panel. We acknowledge, however, that this approach requires 
an investment in training assessors to estimate their uncertainties and in constructing 
a sufficiently strongly connected assessment graph. Different panels will deal with the 
trade-off between investment of effort and accuracy of results in different ways. 
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Appendix A: Scale for confidences 

We motivated the model by proposing that the noise terms be of the form (Taolao with 
the r] ao independent zero-mean random variables with unit variance, so that the a ao are 
standard deviations. Nevertheless, multiplying all the confidences by the same number 
does not change the results of the least squares fit, nor our quantifications of robustness 
(Appendices C and D). Thus the r] ao can be taken to have any variance w, as long as it 
is the same for all assessments. It is only ratios of confidences that have significance. 

The fitting procedure can be extended to infer a best fit value for w. Even if the 
assessors provide confidences based on assuming w = 1, the best fit for w is not 1 in 
general. Assuming independent Gaussian errors, the maximum likelihood value for w 
comes out to be 

w = R/N , 

where 

R — ^ ^ c ao (s ao v Q b a ) (18) 

ao 

is the residual from the least squares fit ( v , b ) for (u, b ) and N is the total number of 
assessments. The posterior distribution for w, given a prior distribution, is obtained in 
Appendix D. 
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Appendix B: Degeneracy-breaking conditions 


We can remove the degeneracy in the equations Q and © in different manners from 
equation m used here. Indeed, use of m can lead to an average shift from the 
scores to the true values. This does not matter if only a ranking is required, but if 
the actual values are important (e.g. for degree classification), then a better choice of 
degeneracy-breaking condition is needed. 

A preferable confidence-weighted degeneracy-breaking condition is 

EOa = °> (19) 

a 

which from ([7]) automatically implies C 0 v 0 = Ylao c aoS a o, thus avoiding the possibility 
of such systematic shifts. 

From a theoretical perspective, however, the best choice of degeneracy-breaking con¬ 
dition is to choose a reference value u re f (think of a notional desired mean) and require 

^2 c ao(v 0 b a ) — Cv re f, (20) 

ao 

where 

C = Y, C ao- ( 21 ) 

ao 

Using the notation in (1101) and (HD this can equivalently be written as 

X] C oVo - X C 'a h a = Cv Ie f. (22) 

o a 

To reduce the possible average shift from confidence-weighted average scores to true 
values, the reference value v [e f should be chosen near the confidence-weighted average 
score 

s = X c aoSgo/C. (23) 

ao 

Choosing u re f exactly equal to s gives (USD, which makes the confidence-weighted average 
bias come out to 0 and the confidence-weighted average value come out to s. We will 
show in Appendix C, however, that the results are a factor y/2 more robust to changes 
in the scores if v re { is chosen to be fixed rather than dependent on the scores. 

For any affine choice of degeneracy-breaking condition on the biases, ^2 a /3 a b a = 7 , 
the reduced system (fT 6 l) can be solved either by replacing one of the equations by the 
degeneracy-breaking condition as in Section [3J or by appending an additional unknown 
s, adding /3 a s to the lefthand side of each equation (fl 6 l) . and appending the degeneracy¬ 
breaking equation as an additional equation. The latter option has the advantage of 
preserving the symmetry of the matrix representing the system of equations and hence 
twice as efficient algorithms to solve them (symmetric indefinite factorisation). The 
additional unknown s comes out to be 0 because of the relation a(V, B) = 0 mentioned 
after (fTTft . 
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Appendix C: Robustness to changes in the scores 


Here we present our approach to the quantification of the robustness of our method to 
small changes in the scores, using norms that take into account the confidences. 

For s = (sao)( a ,o)eEi define the operator K by 


Ks = 


V 

B ’ 


(24) 


as a shorthand for the definitions in equations ([HI) and Q, so that the equations mm 
can be written as 


L 


v 

b 


= Ks. 


(25) 


Thus, if a change As is made to the scores, we obtain changes Aw, A b of magnitude 
bounded by 


Av 
A b 


< ||L _ 1 iC|| ||As||, 


(26) 


where L~ l is defined by restricting the domain of L to (fT2l) and its range to a(V, B ) = 0, 
and appropriate norms are chosen. In this appendix, we propose that appropriate choices 
of norms are _ 


I As I 




A 


(27) 


|| (Aw, A6) || resu lts = 


I E 


Cao(A V % + Ab 2 a ) = 




.^CoAvl + ^C'a 


Ml 


(28) 


and the associated operator norm from scores to results for ||L 1 K ||. With the confidence- 
weighted degeneracy-breaking condition H C' a b a = 0 (fl9l) instead of (fT2l) we obtain 


\\L~ l K\\ < A = , (29) 

V/^ 

where fj ,2 is the second smallest eigenvalue of a certain matrix M formed from the 
confidences (see (l33l) h In particular, this gives 


\5v 0 


< 




(30) 


The factor of y/2 can be removed if one switches to an ideal degeneracy-breaking condi¬ 
tion as in (1201) of Appendix B. 

As a consequence, to maximise the robustness of the results, the task for the designer 
of E is to make none of the C Q much smaller than the others and to make ^2 significantly 
larger than 0. The former is evident (no object should receive significantly less assessment 
or less expert assessment than the others). The latter is the mathematical expression 
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of how well connected is the graph T (equivalently T^). To design the graph T requires 
a guess of the confidence levels that assessors are likely to give to their assessments 
(based on knowing their areas of expertise and their thoroughness or otherwise) and 
a compromise between assigning an object to only the most expert assessors for that 
object and the need to achieve a chain of comparisons between any pair of assessors. 

We now go into detail, derive the above bounds and describe some computational 
shortcuts. 

One can measure the size of a change As ao to a score s ao by comparing it to the 
declared uncertainty a ao - Thus we take the size of As ao to be y/c^ |As ao |. We propose 
to measure the size of an array As of changes As ao to the scores by the square root of 
the sum of squares of the sizes of the changes to each score, as in (1271) . Supremum or 
sum-norms could also be considered but we will stick to this choice here. 

It is also reasonable to measure the size of a change Av 0 to a true value v a by 
comparing it to the uncertainty implied by the sum of confidences in the scores for object 
o. Thus the size of Av a is defined to be \[C~ 0 |Au 0 |, where C Q is the total confidence in 
the assessment of object o. Similarly, we measure the size of a change A b a in bias b a by 
y/C[ \Ab a \ where C' a is the total confidence expressed by a given assessor. Finally, we 
measure the size of a change (Au, A b) to the vector of values and biases by the square 
root of sum of squares of the individual sizes, as in (1281) . 

The size of the operator L~ 1 K is measured by the operator norm from scores to 
results, i.e. 


i r —I tv-h 11-^ -A. ZAs11results 

|L K || = sup 




l|As||s 


(31) 


The operator L~ 1 K is equivalent to orthogonal projection with respect to the norm (1271) 
from the scores to the subspace X of the form s ao = v 0 + b a with a degeneracy-breaking 
condition to eliminate the ambiguity in direction of the vector v a = 1, b a = —1. 

The tightest bounds in (1261) are obtained by choosing the degeneracy-breaking con¬ 
dition to correspond to a plane perpendicular to this vector with respect to the inner 
product corresponding to equation (|28l) . Thus we choose degeneracy-breaking condition 

m- 


Theorem: For a connected graph T and with the degeneracy-breaking condition (1201) . 
the size of the change (Au, A b) resulting from a given array of changes As in scores is 
bounded by 

|| (Au, A6) || reS ults < —=||As|| scores ? (32) 

V^2 

where ji 2 is the second smallest eigenvalue of the matrix 


M = 


In 0 D 
D T In a 


(33) 


D T ao = c ao /^C^ a , (34) 

Na, No are the numbers of assessors and objects respectively, and for k £ N, Ik is the 
identity matrix of rank k. 
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Proof: Firstly, the orthogonal projection in metric (1271) from s to the subspace X never 
increases length. Secondly, if As ao = Av a + A b a with X^ao c ao(Au 0 — A b a ) = 0 then 

II-AsIlLres = c ^o{^Vo + Ab a ) 2 = g T Mg , (35) 

ao 

where g is the vector with components 

9o = v 0 := a/Co Av 0 , (36) 

g a = b a := \fC^ a Ab a . (37) 

Then, because we restricted to the orthogonal subspace to the null vector in results-norm 
and M is non-negative and symmetric, 

g T Mg > 92^2 9i = /x 2 ||(Au,A6)||2 esults , 
i 

where index i ranges over all objects and assessors. Positivity of 92 holds as soon as the 
graph r is connected, because M is a transformation of the weighted graph-Laplacian 
to scaled variables [ST], so dividing by 92 and taking the square root yields the result. 
□ 

The computation of the eigenvalue 92 of M can be reduced from dimension Na + No 
to dimension Na by 

Proposition: If Na > 2, the second smallest eigenvalue /r 2 of M is related to the second 
largest eigenvalue A 2 of D T D by 

9-2 = 1 — \f~Nz- (38) 

If Na = 1 and No > 2 then 92 = 1- If both are 1 then 92 = 2. 

Proof: The equations for an eigenvalue-eigenvector pair 9 , (u, b ) of M are 

v + Db = 9 V (39) 

D t v + b = 9 b. (40) 

Applying D T to the first equation, multiplying the second by (1 — 9 ), and then substi¬ 
tuting for (1 — 9 )D t v in the second yields 

D T Db = (1 - 9 ) 2 b. (41) 

Thus either b = 0 or (1 — 9) 2 is an eigenvalue A of D T D. In the first case, equation (1391) 
implies 9 = 1, so if 9 7 ^ 1 then (1 — 9) 2 is an eigenvalue of D 1 D. 

Conversely, if (A, b) is an eigenvalue-eigenvector pair for D T D with A 7 ^ 0 then A > 0 
because D T D is non-negative, so put v = ±Db/V A to see that (v,b) is an eigenvector 

of M with eigenvalue 9 = 1 ± \/A. If A = 0 and Db = 0 then 9 = 1 is an eigenvalue of 

M with eigenvector ( v , b) for any v with D T v = 0, e.g. v = 0. 
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Thus there is a two-to-one correspondence between eigenvalues /x of M not equal 
to 1 and positive eigenvalues A of D 1 D (counting multiplicity): n = 1 ± \/X Any 
remaining eigenvalues are 1 for M and 0 for D T D. The degeneracy gives an eigenvector 
v Q = \fCo-, b a = ~ y/C[ of M with eigenvalue 0 and it corresponds to an eigenvalue 1 of 
D t D. All other eigenvalues of M are non-negative because M is. All other eigenvalues 
of D T D are less than or equal to 1 by the Cauchy-Schwarz inequality. So if the second 
largest eigenvalue A 2 of D T D (counting multiplicity) is positive then the second smallest 
eigenvalue //2 of M (counting multiplicity) is 1 — -/A/ If A 2 = 0 then n 2 = 1 because 
existence of A 2 implies IV) 4 > 2 so M has dimension at least 3 and we have only two 
simple eigenvalues fi = 0 and 2 from the simple eigenvalue 1 of D r D, so M must have 
another one but any other value than 1 would give a positive A 2 ; so the same formula 
holds. If there is no second eigenvalue of D T D (because Na = 1) then if No > 2 the 
second largest eigenvalue of M must be 1 by the same argument. If both Na and No are 
1 then the second largest eigenvalue of M is the other one associated with the eigenvalue 
1 of D T D, namely 2. □ 

Note that 

(D T D) aa , = C aa ,/^C^, 

is a similarity transformation of (ED- As examples of second eigenvalues, putting unit 
confidences on the graphs in the left column of Figure Q]we calculate A 2 = 1/3,2/3,1 for 
cases (a),(b),(c) in the right column, giving /r 2 = 1 — -y/1/3,1 — a/2/3, 0, respectively. 

Finally, a user may prefer to use the degeneracy-breaking condition (|19D rather than 
(GqD, perhaps out of uncertainty about what value of u re f to use. Or a user may be happy 
to use (l 20 l) with u re f equal to the confidence-weighted average score, but wants v re { to 
follow this average score if changes are made to the scores. That comes out equivalent 
to using (fT9l) . So we extend our discussion of robustness to treat this case. We find it 
makes the bounds increase by a factor of only \/ 2 . 

Proposition: For T connected and using degeneracy-breaking condition (1191) . the size 
of (Av,A b) resulting from changes As to the scores is at most -^=|| A s 11 scores - 

Proof: If the degeneracy-breaking condition (1201) gives a change (Au, A b) for a change 
As to the scores, then switching to degeneracy-breaking condition (fT9l) just adds an 
amount k of the null vector n = ( 1 , — 1 ) to achieve C' a (Nb a — k) = 0 , i.e. 


k 


Eg C'gAb, 

c 


(42) 


In the results metric, the null vector has length yj^ 0 Co + Xa C' a = \/2 C. Thus the 
correction has length \k\\/2C = \J~^ \ J2C' a Ab a \. Using the condition (1201) we can write 
C' a Ab a = C' a Aba + Yo C 0 Av 0 ), which one can recognise as one half of the inner 

product of (1,1) with (Av,Ab) in results-norm, so it is bounded by -/C/2 ||(Au, A6)||. 
Thus the length of the correction vector is at most that of (Av,Ab). The correction is 
perpendicular to {Av, Aft), thus the vector sum has length at most s/2 ||(Au, A6)||. □ 
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One may also ask about robustness with respect to changes in the confidences c ao - 
If an assessor declares extra high confidence for an evaluation, for example, that can 
significantly skew the resulting v and b. The analysis is more subtle, however, because 
of how the c ao appear in the equations and we do not treat it here. 


Appendix D: Posterior probability distribution 


Another point of view on robustness is the Bayesian one. From a prior probability on 
(v,b) and a model for the rj ao , one can infer a posterior probability for (v,b), whose 
inverse width tells one how robust is the inference. 

In the case of flat prior on (v,b), prescribed w, Gaussian noise, and an affine 
degeneracy-breaking condition, the posterior is Gaussian with mean at the value solving 
equations ([5|), J7J) and the degeneracy-breaking condition, and with covariance matrix 
related to L . Specifically, the posterior probability density for (v,b) is proportional to 

5 

eXP ~^’ 

constrained to the degeneracy-breaking hyperplane, where 


S — ^ ^ Cqo(Sgo V 0 ^a) • 
ao 

Using (f551) and (j!8|) . this can be written as 

exp ~^(g T Mg + R), 

with (Au, A b) being the deviations of (v, b ) from the least squares fit. Thus the covariance 
matrix in these scaled variables is rcM -1 , where for degeneracy-breaking condition 'y T g = 
K, M” 1 is interpreted as the limit as t —>• oo of (M + t77 r ) _1 . Using the degeneracy¬ 
breaking condition (1201) or equivalently (1221) for which 7 is in the null direction of M 
and diagonalising the matrix, we obtain widths y/w/pj for the posterior on g in the 
eigendirections of M. where p 3 are the positive eigenvalues of M. Thus the robustness 
of the inference is again determined by p 2 , but scaled by yfw. 

A slightly more sophisticated approach is to consider w to be unknown also. Given 
a prior density p for w (which could be peaked around 1 if the assessors are assigning 
confidences via uncertainties, but following Jeffreys would be better chosen to be \/w 
if there is no information about the scale for the confidences), the posterior density for 
(w, v , b) is proportional to 

p{w)w~ N t 2 exp — , 

2 w 

where again N is the number of assessments. The maximum of the posterior probability 
density is determined by the least squares fit for (v, b ) (which is independent of w) and 
the following equation for w: 

P'H N , 5 Q 

p(w) 2w 2 w 2 
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For N large, the peak of the posterior has w near the previously determined maximum 
likelihood value w = R/N. For example, taking Jeffreys’ prior, the peak is at w = 
R/(N — 2). Integrating over w (with Jeffreys’ prior) one finds the marginal posterior for 
(v, b ) to be proportional to 

(g T Mg + R)~ n / 2 . 


Incorporating an affine degeneracy-breaking condition, this is a (No + Na — l)-variate 
Student distribution with v = N — No — Na + 1 degrees of freedom. Its covariance 
matrix is w*M~ l with 


w 


R 


v — 2 


and M 1 interpreted by imposing the chosen degeneracy-breaking condition as above. 

So for the degeneracy-breaking condition (HDD, the robustness of the inference is 
given by widths y/w*/Hj for j > 2, in the eigendirections of M on g. In particular, the 
confidence-weighted root mean square uncertainty a for the components of the vector 
(v, b ) is 


a = 


w* x - 1 
20 j>2 ^ 


R TrM- 1 
2(v — 2)C ’ 


(43) 


where Tr denotes the trace and, again, M~ l is interpreted by restricting to the degeneracy¬ 
breaking plane. Marginal posteriors for each v Q and b a can be extracted, but it must 
be understood that in general they are significantly correlated. One way to do this in 
the case of degeneracy-breaking condition (1201) is to find the orthogonal matrix O to 
diagonalise M as O t DO with D = diag(fij), and then the posterior variance of gi is 
w* J2j > i Oji/Hj, but there may be ways to evaluate it without diagonalising M. 

For the case of simple averaging, the root mean-square posterior uncertainty in the 
values, weighted by the numbers n 0 of assessors for object o, is 


^ “ \/N-No’ (44) 

where R is defined in (1471) of Appendix E. This can be derived in an analogous fashion 
to (143p via a Student distribution again, but with M = In 0 - 


Appendix E: Model comparison 

Here we describe the method used in Case Study 2 to compare the three models. 

Bayesian model comparison is based on computing how much evidence there is for 
each proposed model, e.g. Ch.28 of [22]. The evidence for a model M given data D is 
P(D\M). Given strength of belief P(M ) in model M prior to the data (relative to other 
models), one can multiply it by the evidence to obtain the posterior strength of belief 
in model M. It is convenient to replace multiplication by addition, thus we define the 
log-evidence 

LE(M\D) = log P(D\M). 
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If the model M has free parameters /r then 

P(D\M) = J P{D\M,h)P m (h) dfi, 

where Pm(m) is a prior probability density on \i. 

Let there be No objects, Na assessors, let s ao be the score returned by assessor a 
for object o, c ao the confidence in this score in the case of calibration with confidence, s 
be the collection of scores and N be their number. 

First we compute the evidence for simple averaging (SA). Then we treat calibrate 
with confidence (CWC) and lastly incomplete block analysis (IBA) because it is a special 
case of CWC. 


— Simple Averaging 

For simple averaging (SA), the model is that 


$ao — Vo T £ao 

for some unknown vector v of “true” values v a , with e ao iid normal N(0,w) for some 
unknown variance w. Then the probability density for the scores s is 

e ~(s a 0 —v 0 ) 2 / 2 w 

\J2ttw 

with the product and sum being over the assessments that were carried out. 

To work out the evidence for SA the model must include a prior probability density 
for v and w. The simplest proposal would be A ~ N ° L~ l w~ l on v 0 € [ v m i n ,v max \, w € 
[w min ,w max \, where A = v max - v min and L = log {w max / Wmin) ■ This is the product of 
a “box” prior on v and Jeffreys’ prior on w (truncated to an interval and normalised). 
For comparison with the other models, however, it is easier to replace the box prior on 
v by a “ball” prior, giving 

PsA(v ’ w)= zk'm 

on 

^ ~2( y o - v ref ) 2 < N 0 CTo, ( 45 ) 

O 

for some anticipated average score v re f and upper estimate of the width oo of the 
distribution of values v 0 . The normalisation is 

= (' kN 0 < j 2 0 ) n °/ 2 

° r(Ao/2 +1) ’ 

where T is the Gamma function. For w m i n is it reasonable to choose v? where u is the 
smallest change any assessor could contemplate. For w max it is reasonable to choose a 2 0 . 


P(s|u, w) = J 


(2 


7 TW 


N/2 ^ Vo) 2 
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For each object o, 


^ ^ (Sgo V Q ) — n 0 {v 0 S 0 ) + -R 05 

a 


where n Q is the number of assessors for object o, s Q is the mean of their scores, and the 
residual 

Ro = ~ ^°) 2 ' ( 46 ) 

a 

Thus 

P(s\v,w)Psa(v,u>) = - 77 —-r — (2nw)^ N ^ 2 e~^ T, 0 n °( v °~ s °) 2 e ~ R / 2w ^ 

Zo-Lw 

where 

r = J2 r o- ( 47 ) 

o 

To integrate this over v and w, we assume the bulk of the probability distribution lies 
in the product of the ball and the interval, and so approximate by extending the range 
of integration to R N ° x (0,oo). Integrating the exponential over v a produces a factor 



Thus, integrating over all components of v yields 

-±—{2Ttw)-( N - N °V 2 e- R ' 2w TT n~*. 

Zo-Lw - LJ - 

Integrating this over w, we obtain the evidence 

p(sa 1 .) = r n-o 1 

and the log-evidence 

LE(SA\s) = log T - N ~ 2 N ° logvr R - ^ log n 0 - logZ 0 - log L. 

' ' O 

— Calibration with Confidence 

For Calibrate with Confidence (CWC), the model is 

Sao — Vo P b a T O’ao'Haoi 

for some unknown vectors v of true values v a , and b of assessor biases b a , with r] ao 
iid normal N(0,w) for some unknown variance w. The uncertainties cr ao correspond 
to confidences c ao by a ao = 1 /y/c ao , which are considered as given (one could propose 
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a generative model for them too, but that would require further analysis). Then the 
probability density for s is 

p Cao^Sao Vo 

p( s \v,b,w)= n -— — 

V 2 TTW/Cao 

For prior probability density over the parameters v,b,w, we want to build in a 
degeneracy-breaking condition. We used Y2a ba = 0 in our calculation, thus we take 
prior “density” 

on the product of the balls (1451) and Yhatfi < ^a^a an d interval [w m i n ,Wmax], where 6 
is the delta function. Here, cta is an estimated upper bound for the standard deviation 
of the biases, and the normalisation is 


= (27™) N ! 2 e 2 1 »S c «°( s “ v ° M 2 JJ c i/ 2 . 


1 (vrA^^-i)^ 

r((A t a + l)/2) 


Note that the interpretation of w is not the same as for SA, so one might choose a different 
prior for it. For example, if the a ao are fairly accurate values for the uncertainties in 
the scores then the prior for w should be peaked around w = 1, but if they are on an 
undetermined scale a truncated Jeffreys prior is sensible. The only thing is that one 
might want to choose a different interval for it, but for application to IB A where the 
c ao = 1 or to CWC if the c ao are on a scale centred around 1, such as we have used to 
translate the quantitative high/medium/low confidence ratings, the same interval should 
be reasonable. Similarly, one might want to use a different value for oo if one believes 
that the spread in values is more due to variation in assessor bias than true value, but 
in our case we think it reasonable to use the same ao- 
Thus 

P(s\v,b,w)P CW c(v,b,w) = 1 (2tT w^/^-^^Vo+K-s^^ba) J] 


Again we assume the bulk of this lies in the product of balls and interval, so we approx¬ 
imate its integral by extending the domains to infinity. Now 



h T Ah + R, 


where h is the vector with No + N A components, h a = (v 0 — v 0 ), h a = (b a — b a ), (v, b) is 
any least squares fit to this model (without loss of generality satisfying the degeneracy¬ 
breaking condition), the residual R is now Yl c ao{s ao — v Q — b a ) 2 (as in (|T%1) ) and A is 
the matrix with block form 

diag(C'o) c T 
c diag(C^) 
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Choose one assessor, say n, and integrate over b n . This yields 


^(2 

with h being the remaining components of h and 

7 = T diag(C 0 ) c T 

l c diag(C^) + C' n E \ ’ 

where c ao = c ao — c no and E aa i = 1, restricted to a, a' ^ n , which takes into account that 
bn = ~~ ^a ■ 

Thus the integral over h is 

--- (2irw)-"/ 2 e~ R / 2w ~^£SL, 

ZoZaLw ydetd 

where v = N — No — Na + 1. 

Finally, we integrate over w to obtain 

p{cwc\s) = -^—(TtR )-''/ 2 r(^)Ib/g, 

ZoZaL 2. y^etA 

and the log-evidence is 

LE(CWC\s) = log T(^^ log nR + ^ ^ log c ao - ^ log det A - log Z 0 - log Z A - log L. 

— Incomplete Block Analysis 

The model for incomplete block analysis (IBA) is the same as for CWC but taking the 
confidences c ao = 1 for all the assessments. Thus the log-evidence for IBA given the 
scores s is 

LE(IBA\s ) = logT(^) - ^ log 7r R - ^ log det A — logZ 0 - logZ A - log L, 
with the appropriate changes to R and A. 


Appendix F: Potential Refinements to the method 

One could develop refinements to the basic model (JU). For example, assessors might 
have not only an additive bias but also different scales, so for example 

Sao = Kv 0 + b a + CraoVac- (48) 

Fitting A, v , b is more complicated, however, than just v, b. 
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An assessor may have a bias correlated with their confidence [ 23 ] or with some other 
feature like familiarity [231 . Assessors may like to give round-number scores or the 
organisers of the panel may insist on them. Assessors may have different scales for 
confidence, so their confidences may need calibrating as well as their scores. 

Another problem is that often assessors are asked to assign scores in a fixed range 
[A, B], e.g. 1 — 10. Then any model for bias really ought to be nonlinear to respect the 
endpoints. One way to treat this is to apply a nonlinear transformation to map a slightly 
larger interval (a, b ) onto R, e.g. 


I4S 


x — | (a + b) 

(b — x){x — a) 


(49) 


or 


b — x 

ie>S = log-, 

x — a 


(50) 


apply our method to the transformed scores, scaling the confidences by the inverse square 
of the derivative of the transformation, and then apply the inverse transformation to the 
“true” values. On the other hand, it may be inadvisable to specify a fixed range because 
it requires an assessor to have knowledge of the range of the objects before starting 
scoring. Thus one could propose asking assessors to use any real numbers and then use 
equation (|48ji to extract true values v. A simpler strategy that might work nearly as 
well is to allow assessors to use any positive numbers but then to take logarithms and 
fit equation flU) to the log-scores. The assessor biases would then be like logarithms of 
exchange rates. The confidences would need translating appropriately too. 

One issue with our method is that the effect of an assessor who assesses only one 
object is only to determine their own bias, apart from an overall shift along the null vector 
(v,b) = (1,-1) for the rest. To rectify this one could incorporate a prior probability 
distribution for the biases (indeed, this was done by [8j in the form of a regulariser). 

An interesting future project is to design the graph T optimally, given advance guesses 
of confidences and constraints (like conflicts of interest) or costs for the number of 
assessments per assessor. “Optimality” would mean to achieve maximum precision or 
robustness of the resulting values. For instance, in each case of Figure [T] each assessor 
has the same amount of work and each object receives the same amount of attention, but 

(a) achieves full connectivity with a resulting value for /12 of 1 — 1/3 ~ 0.42, whereas 

(b) achieves moderate connectivity and a smaller value of fi 2 = 1 — \/2/3 ~ 0.18, and 

(c) is not even connected and has [12 = 0. 
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